Att-GCN+MCTS [2021] Generalize a Small Pre-trained Model to Arbitrarily Large TSP Instances

Author: Zhen, TONG 2023-12-21
https://arxiv.org/pdf/2012.10658.pdf

Acknowledgments go to Prof. Zha and his dedicated students. This paper, a product of research conducted at my alma mater, The Chinese University of Hong Kong, Shenzhen, found its place in the prestigious 2021 AAAI publication.

Within these pages, the authors introduce an innovative concept—the fusion of Supervised Learning (SL) and Reinforcement Learning (RL).

Method

This research is specifically tailored for the Euclidean space Traveling Salesman Problem (TSP). Initially, the Supervised Learning (SL) component takes charge, producing a graph's heatmap. This heatmap represents a probability output assigned to each edge, as formally described below:

Pij[0,1]P_{ij}\in[0, 1]

The probability (\(P_{ij}\)) signifies the likelihood of an edge existing between nodes \(i\) and \(j\). The researchers adopted an off-line learning approach, training a graph convolutional residual neural network with attention mechanisms—abbreviated as Att-GCRN. This model was trained on 990,000 instances of the Traveling Salesman Problem (TSP), leveraging Concorde as a solver for labels.

Upon completing the training of the Graph Convolutional Network (GCN) or Graph Neural Network (GNN) for smaller graphs, the team proceeded to sample sub-graphs from larger graphs. For these smaller graphs, they generated individual heatmaps, merging them subsequently into a comprehensive heatmap. Finally, to search for high-quality TSP solutions, guided by the information encoded in the consolidated heatmap, they employed Monte Carlo tree search (MCTS).

Sub-Graph Sampling

The sampling process is intuitively straightforward. When dealing with a graph, one can simply randomly select the least visited nodes, edges, and their k-nearest neighbor nodes. This involves utilizing a visitation variable, denoted as OiO_i  or Oij,O_{ij},  to keep track of the frequency of visits to each node or edge. The heat map is then calculated based on these visitation records. This iterative process continues until a predetermined threshold is reached. Additionally, the graph conversion step, which essentially entails a scale transformation for the sub-graph, is a straightforward operation.

Merging

The story sounds good so far, except for the emerging step. Actually, it is also very easy:

Pij=1Oijl=1IPij(l)P_{ij} = \frac{1}{O_{ij}}\sum_{l = 1}^I P''_{ij}(l)

Reinforcement learning for solutions optimization

They use the action of k-opt, which is to delete k edges in the solution state and add k edges to the new solution state. Afraid of making an infeasible solution, they take a way called the compact method.

They first search the action in a small neighborhood i.e. k = 2. then enlarge the neighborhood with RL, because the searching space can be too big.

The motivation for the MCTS is for the big search space (state space), the main idea of MCTS is first to take a limited depth of searching tree, then simulate the whole trajectory by a few samples, and estimate the whole space with the Monte-Carlo Method. To balance the exploration and exploitation, it uses the UCB method, which counts the state visit times, and the estimated value. This paper is just doing the same thing using matrix W,QW, Q, corresponding to the estimated value, and visit times, but especially in graph problems, therefore their dimension is nxn. They are called the weight matrix and access matrix,(initialized to 0) and record the times that edge (i, j) is chosen during simulations. Additionally, a variable M (initialized to 0) is used to record the total number of actions already simulated.

As explained before, each action is represented as a=(a1,b1,a2,b2,...,ak,bk,ak+1)a = (a_1, b1, a_2, b_2, . . . , a_k, b_k, a_{k+1}), vontaining a series of sub-dicisions aia_i, 1ik1 ≤ i ≤ k (kk  is also a variable, and ak+1=a1a_{k+1} = a_1), while bib_i could be determined uniquely once aia_i is known.

In the simulation step, they use a variable called potential Zbi,jZ_{b_{i,j}}, basically the UCB form in the graph for doing MCTS. The higher the value of Zbi,jZ_{b_{i}, j}, the larger the opportunity of edge (bi,j)(b_i, j) to be chosen:

Zbi,j=Wbi,jΩbi+αln(M+1)Qbi,j+1Z_{b_i, j} = \frac{W_{b_i, j }}{\Omega_{b_i}}+\alpha\sqrt{\frac{\ln(M+1)}{Q_{b_i, j}+1}}
Ωbi=jbiWbi,jjbi1\Omega_{b_i} = \frac{\sum_{j\neq b_i}W_{b_i, j}}{\sum_{j\neq b_i}1}

Ωbi\Omega_{b_i} is an averaged Wbi,jW_{b_i, j} of all the edges relative to vertex bib_i. After the leaf node in the MCTS is selected, then rollout, which is called the sub-decisions in the paper. Randomly choosing action aia_i, and action the determined bib_i. Finally, backpropagate the rollout, and update the probability of choosing the vertices as:

Pj=Zbi,jlXZbi,lP_j = \frac{Z_{b_i, j}}{\sum_{l\in\mathbb{X}}Z_{b_i, l}}

X\mathbb{X}  is all the leaf candidate nodes for the next edge to connect. If any vertices of bib_i are in an improving state, keep it; otherwise, randomly choose one (bi,j)(b_i, j). Whenever a state action pair is examined, backpropagate the Wi,jW_{i, j} weight matrix

Wbi,ai+1Wbi,ai+1+β[exp(L(s)L(snew)L(s))1]Qbi,ai+1Qbi,ai+1+1W_{b_i, a_{i+1}}\leftarrow W_{b_i, a_{i+1}}+\beta[\exp{(\frac{L(s)-L(s_{new})}{L(s)})-1}]\\Q_{b_i, a_{i+1}}\leftarrow Q_{b_i, a_{i+1}}+1

Experiment

According to the experiment of the paper, they are very close to Concord with way faster speed.